Toward Efficient Post-Training Quantization of Pre-Trained Language Models

129

FIGURE 5.7

The overview of algorithm proposed in [5].

In summary, this paper’s contributions are as follows: (1) new kernels for the efficient

and accurate integer-only GELU and Softmax. That is, the GELU and Softmax are approx-

imated with lightweight second-order polynomials, which can be evaluated with integer-only

arithmetic; (2) integer-only LayerNorm computation by leveraging a known algorithm for

integer calculation of square root [49]; and (3) a total integer-only quantization for language

models by utilizing the proposed approximations of GELU, Softmax.

5.5

Toward Efficient Post-Training Quantization of Pre-Trained

Language Models

Bai et al. [5] proposes MREM that aims at improving the performance of post-training

quantization for language models, while simultaneously maintaining the training efficiency,

memory overhead, and data accessibility equipped by post-training quantization. The al-

gorithm overview proposed in [5] is presented in Fig. 5.7. As can be seen, the full-precision

and quantized models are first partitioned into multiple modules, then put on different com-

puting devices. Each module samples input tensor from its input queue, which makes them

can be trained locally without waiting for their predecessors. Moreover, teacher forcing is

applied to mitigate the issue of reconstruction error propagation on the quantized module.

5.5.1

Module-Wise Reconstruction Error Minimization

At first, the language models are partitioned into multiple modules, each consisting of mul-

tiple transformer layers. Then, they propose module-wise reconstruction error minimization

(MREM) to optimize each module’s model weight and quantization parameters, which per-

mits sufficient optimization. Specifically, given a language model with L transformer layers,

embedding layers and the classification head, the model is partitioned into N modules. Sup-

pose the n-th module contains p transformer layers, then it include [lj, lj+1, lj+2, . . . , lj+p1]

transformer layers with lj being the first layer of this module. The proposed MREM aims

at minimizing the joint reconstruction errors between all intermediate output ˆfl of the

quantized n-th module from its full-precision counterpart fl as follows:

Ln =

j+p1



i=j

ˆflifli2.

(5.10)